Skip to content

feat: Add file based Hugging Face datasets to kedro-datasets#1373

Open
iwhalen wants to merge 21 commits into
kedro-org:mainfrom
iwhalen:feat/add-local-hf-dataset
Open

feat: Add file based Hugging Face datasets to kedro-datasets#1373
iwhalen wants to merge 21 commits into
kedro-org:mainfrom
iwhalen:feat/add-local-hf-dataset

Conversation

@iwhalen
Copy link
Copy Markdown
Contributor

@iwhalen iwhalen commented Apr 5, 2026

Description

Adds datasets for interacting with Hugging Face datasets on a file system.

Development notes

Added docs, tests, ran in a fresh pipeline.

Iterable and in-memory versions have both been tested as well.

Note

I couldn't figure out a good way to save an IterableDataset without looping through it entirely first.

Maybe there's a better way someone knows about.

Updated jsonschema/kedro-catalog.1.00.json.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Updated jsonschema/kedro-catalog-X.XX.json if necessary
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes
  • Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

@iwhalen iwhalen changed the title Feat/add local hf dataset feat: Add huggingface.LocalHFDataset to kedro-datasets Apr 5, 2026
@iwhalen iwhalen marked this pull request as ready for review April 5, 2026 20:48
Copy link
Copy Markdown
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level concerns:

  1. Use of fsspec.
  2. Bespoke split/partition/multi-file handling (whatever you want to call it).
  3. Handling too many types in one.

I think I'd personally it rather be a lightweight wrapper that delegates more to underlying Hugging Face APIs.

Unless you're convinced, or if you disagree, it's probably worth getting a second opinion/review on the above.

Comment thread kedro-datasets/kedro_datasets/huggingface/hugging_face_dataset.py Outdated
Comment thread kedro-datasets/kedro_datasets/huggingface/hugging_face_dataset.py Outdated
Comment thread kedro-datasets/kedro_datasets/huggingface/hugging_face_dataset.py Outdated
Comment thread kedro-datasets/kedro_datasets/huggingface/hugging_face_dataset.py Outdated
if protocol == "file":
_fs_args.setdefault("auto_mkdir", True)

self._fs = fsspec.filesystem(self._protocol, **_credentials, **_fs_args)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this approach is that Hugging Face also supports remote URIs natively (e.g. https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_from_disk). I think we don't want to use fsspec in those cases, because Hugging Face very well could have a more efficient native path.

This is a hard problem to solve; we have similar concerns on the Ibis side (e.g. #1298), but intuitively I'd err on the side of leaving it up to the Hugging Face APIs.

Copy link
Copy Markdown
Contributor Author

@iwhalen iwhalen Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below on needing a filesystem.

In order to save a DatasetDict we have to be able to access it.

Hugging Face uses fsspec under the hood anyway 🤷


if self._fs.isdir(load_path):
paths = {
PurePosixPath(p).stem: p for p in self._fs.glob(f"{load_path}/*{ext}")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to have the extension, or is this too narrow? Will people come saying they have HDF5 files with .hdf5 extension instead of .h5?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, look at the C4 example under https://huggingface.co/docs/datasets/en/loading#hugging-face-hub; there's an example with .json.gz extension.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... maybe I'll just glob everything in the directory then?

}
return DatasetDict(
{
split: loader(path, **self._load_args)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you can specify splits in the data_files argument of load_dataset (e.g. https://huggingface.co/docs/datasets/loading#json); would this be preferable to constructing the DatasetDict with multiple Dataset.from_* calls? I haven't looked into the implementation, but I'd be curious if there's a good reason to not use the higher-level API.

Referencing the same C4 example from above (https://huggingface.co/docs/datasets/en/loading#hugging-face-hub), it seems you can also pass the wildcard directly under data_files.

Copy link
Copy Markdown
Contributor Author

@iwhalen iwhalen Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's two things happening here:

  • Loading a dataset from the Hub (handled by HFDataset now - its not perfect, but I'm ignoring it in this PR).
  • Loading / saving from a filesystem (what I'm trying to implement here).

Loading

The load_dataset can handle all our needed types with a call like:

load_dataset("csv", data_files="path/to/data.csv")

Same goes for parquet, lance, hdf5, json.

Arrow datasets don't play nice like this though and use load_from_disk("path/to/arrow").

Saving

This procedure works for everything but Arrow:

data = Dataset.from_dict({"a": [1,2], "b": [3,4]})
data.to_parquet("data.parquet")
load_dataset("parquet", data_files="data.parquet")

I couldn't find a way to save a DatasetDict to a particular format.

This loop is what happens in the DatasetDict itself here:

https://github.com/huggingface/datasets/blob/4775eeba2d5e73349790f7575182d71d5cd8e1bf/src/datasets/dataset_dict.py#L1376-L1383

Which eventually makes it down to this private method that only will save to Arrow format.

This same goes for the individual Dataset objects. You have to use the to_<format> methods there's not a single method to rule them all like there is with loading. Lance and HDF5 don't actually have save methods handled by Hugging Face :(

Agree with your point on file name format strictness though.

glob_function=self._fs.glob,
)

def _load(self) -> DatasetLike:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you get an IterableDataset, for example? At the bottom of https://huggingface.co/docs/datasets/en/filesystems, it seems like you could do IterableDataset.from_dict(), but it's not clear you're ever doing that, right?

Comment thread kedro-datasets/kedro_datasets/huggingface/hugging_face_dataset.py Outdated
Comment thread kedro-datasets/RELEASE.md Outdated
@iwhalen
Copy link
Copy Markdown
Contributor Author

iwhalen commented Apr 11, 2026

High-level concerns:

1. Use of fsspec.

2. Bespoke split/partition/multi-file handling (whatever you want to call it).

3. Handling too many types in one.

I think I'd personally it rather be a lightweight wrapper that delegates more to underlying Hugging Face APIs.

Unless you're convinced, or if you disagree, it's probably worth getting a second opinion/review on the above.

Thanks for the review @deepyaman! This was a little sloppy you're right.

My first thought was also to try to get the Hugging Face API to handle everything. I'll try again on this and update.

Otherwise, I'll just make a separate dataset for each format HF supports.

Thanks again!

@iwhalen iwhalen force-pushed the feat/add-local-hf-dataset branch 2 times, most recently from 130d51c to 28c80df Compare April 11, 2026 16:37
@iwhalen iwhalen requested a review from deepyaman April 11, 2026 16:40
@iwhalen
Copy link
Copy Markdown
Contributor Author

iwhalen commented Apr 11, 2026

Ok changes are all up. I think this is a bit better.

  • Each dataset has its own file.
  • Saving an loading a little simpler.

Unchanged:

  • Looping over DatasetDict objects to save (have to, see comment).
  • Awkward isinstance checks to control behavior for DatasetDict and Iterable* types.

Copy link
Copy Markdown
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution and for being so responsive to feedback! The split into separate dataset classes makes sense, and it matches how the rest of kedro-datasets is organized (e.g. pandas.CSVDataset, pandas.ParquetDataset). Also, Arrow genuinely behaves differently from the other formats.

A few high-level suggestions before we iterate on the details:

Scope this PR to the four round-trip formats. I'd suggest removing HDF5Dataset and LanceDataset (and their tests/docs) from this PR. Focus on Arrow, Parquet, CSV, and JSON — these all support full save + load round-trips and share the same base class, so they belong together. The read-only formats are a separate concern that deserves a small follow-up PR where we can get the save() error handling right (e.g. overriding save() directly so the error is raised immediately, rather than letting the base class do type checking, iterable materialization, and path resolution before the error surfaces in _save_dataset).

Deduplicate the tests. The test files for CSV, JSON, and Parquet are near-identical (differing only in class name and extension). Consider a parametrized shared test instead.

Comment thread kedro-datasets/kedro_datasets/huggingface/_base.py
Comment thread kedro-datasets/kedro_datasets/huggingface/_base.py Outdated
Comment thread kedro-datasets/kedro_datasets/huggingface/_base.py Outdated
Comment thread kedro-datasets/kedro_datasets/huggingface/_base.py Outdated
Comment thread kedro-datasets/kedro_datasets/huggingface/_base.py Outdated
Comment thread kedro-datasets/kedro_datasets/huggingface/hugging_face_dataset.py Outdated
Comment thread kedro-datasets/kedro_datasets/huggingface/_base.py
@iwhalen iwhalen force-pushed the feat/add-local-hf-dataset branch 3 times, most recently from 0b4bb94 to 212769f Compare April 24, 2026 01:42
@iwhalen
Copy link
Copy Markdown
Contributor Author

iwhalen commented Apr 24, 2026

@ElenaKhaustova Ok! I think I addressed most of the changes. There's still a couple things I'm not in love with.

I see my tests are failing, so I'll get to those. Just wanted to send a note on some higher level things.

Checking for directory in non-Arrow datasets

In the FilesystemDataset we have to make an assumption somewhere on whether or not we're trying to load a DatasetDict.

I thought it was reasonable that, if the provided path looks like a directory (or is an existing directory), we assume its a directory.

Then, if the user doesn't tell us what we're looking for in the directory, we throw an error.

That's what's happening here in _validate_load_paths.

In other words, we can't call load_dataset("json", data_files="my/directory/") we have to call load_dataset("json", data_files={"data": "my/directory/data.json", "labels": "my/directory/labels.json").

I'm not sure if there's a smarter way to do this. Let me know what you think.

data_files convenience processing for non-Arrow datasets

As I say above, the right way to load from a directory is:

load_dataset(
    "json", data_files={"data": "my/directory/data.json", "labels": "my/directory/labels.json"
)

Where the values in the data_files dictionary are full paths to your data.

If we were 100% strict, our yaml entries for this dataset would have to look like:

reviews:
  type: huggingface.JSONDataset
  path: data/01_raw/reviews
  load_args:
    data_files:
      labels: data/01_raw/reviews/labels.json
      data: data/01_raw/reviews/data.json

Instead, I introduce the helper _build_data_files so that the yaml entries can look like:

reviews:
  type: huggingface.JSONDataset
  path: data/01_raw/reviews
  load_args:
    data_files:
      labels: labels.json
      data: data.json

Maybe this is overstepping! Happy to hear thoughts on this as well.

@iwhalen iwhalen requested a review from ElenaKhaustova April 24, 2026 12:29
Copy link
Copy Markdown
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @iwhalen — this is much improved!

On your two questions:

Directory loading with data_files: You're right — my earlier suggestion to use data_dir was too optimistic about what HF handles automatically for non-Arrow formats. Your approach of requiring explicit data_files is the correct middle ground: it removes the fragile glob-based discovery while still delegating the actual loading to load_dataset.

_build_data_files helper: — this is a good UX call.

A few remaining items in inline comments below.

Comment thread kedro-datasets/kedro_datasets/huggingface/_base.py Outdated
Comment thread kedro-datasets/kedro_datasets/huggingface/_base.py Outdated
Comment thread kedro-datasets/kedro_datasets/huggingface/_base.py
Comment thread kedro-datasets/kedro_datasets/huggingface/hugging_face_dataset.py Outdated
Comment thread kedro-datasets/kedro_datasets/huggingface/arrow_dataset.py Outdated
@iwhalen iwhalen force-pushed the feat/add-local-hf-dataset branch from 5910c0a to 8dbdc32 Compare April 29, 2026 23:02
@iwhalen
Copy link
Copy Markdown
Contributor Author

iwhalen commented Apr 29, 2026

@ElenaKhaustova seems like we're getting closer!

I addressed the changes you had above and fixed my doctest issues.

Now I'm seeing some failures in CI that seem to be unrelated to the changes in this branch.

Any advice?

@iwhalen iwhalen requested a review from ElenaKhaustova April 29, 2026 23:09
Copy link
Copy Markdown
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @iwhalen, this looks great now! All the feedback from the previous round has been addressed cleanly:

  • __init__ is now free of filesystem calls and validation logic, matching the convention used elsewhere in kedro-datasets
  • _save_dataset_dict now validates that data_files keys match the DatasetDict split names — much better than a silent "file not found" at load time
  • RuntimeErrorDatasetError for consistency
  • DatasetLike is now defined once in _base.py
  • The Arrow docstring (and the others) correctly describe the iterable behavior now

The CI failure is actually from this PR — coverage dropped to 99.88% and the project requires 100%. The uncovered lines are:

  • _base.py:199-200 — the except DatasetError: return False branch in _exists()
  • arrow_dataset.py:78-79 — the same branch in Arrow's _exists() override

This branch fires when _get_load_path() can't resolve a path — i.e. a versioned dataset with no saved versions yet. Adding a test like this to both test_filesystem_datasets.py and test_arrow_dataset.py should cover it:

def test_exists_no_versions(self, kedro_dataset_cls, path_file):
    """`exists()` returns False (not raises) when no versions are saved yet."""
    ds = kedro_dataset_cls(path=path_file, version=Version(None, None))
    assert ds.exists() is False

Comment thread kedro-datasets/kedro_datasets/huggingface/hugging_face_dataset.py Outdated
@ElenaKhaustova ElenaKhaustova requested a review from ankatiyar May 11, 2026 17:17
@iwhalen
Copy link
Copy Markdown
Contributor Author

iwhalen commented May 12, 2026

@ElenaKhaustova Fixed some more issues, including a weird one with chromadb dependencies on python 3.14.

It seems like this was known: chroma-core/chroma#2571

(Failure was here).

@ElenaKhaustova
Copy link
Copy Markdown
Contributor

@iwhalen, thank you for addressing the remaining comments!

It looks like the failure isn't from your HF changes. All unit-tests jobs (where the HF datasets live) are passing on every platform/Python combination, including Windows + 3.14.

The only failing job is unit-tests-experimental (windows-latest, 3.14), which runs the tests under kedro_datasets_experimental/. Looking at the PR diff, your branch includes two unrelated files that look like they came in via a rebase:

  • kedro_datasets_experimental/opik/opik_evaluation_dataset.py
  • kedro_datasets_experimental/tests/opik/test_opik_evaluation_dataset.py

That Opik test suite is failing specifically on Windows + Python 3.14 (it passes on every other matrix entry). That's a pre-existing main-branch issue, not something introduced here.

Could you rebase fresh on main to drop those unrelated Opik changes from your diff? And remove all changes unrelated to your PR. If the Windows + 3.14 experimental job still fails after that, we'll look into it separately on our side in parallel.

@ankatiyar
Copy link
Copy Markdown
Contributor

The failures on experimental datasets will not block merging PRs btw, so no worries about that.

@iwhalen iwhalen force-pushed the feat/add-local-hf-dataset branch from f842e49 to 8beeb3d Compare May 12, 2026 11:56
@iwhalen
Copy link
Copy Markdown
Contributor Author

iwhalen commented May 12, 2026

@ElenaKhaustova Opik changes removed! Not sure how they got in there...

Thanks again for all the help.

@iwhalen iwhalen changed the title feat: Add huggingface.LocalHFDataset to kedro-datasets feat: Add file based Hugging Face datasets to kedro-datasets May 12, 2026
Comment thread kedro-datasets/pyproject.toml Outdated
save_path = get_filepath_str(self._get_save_path(), self._protocol)

if isinstance(data, DatasetDict):
self._save_dataset_dict(data, save_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: Do you need a data_files dict here to specify filepaths?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I see below it's being read from _load_args. Should this be read from save_args instead?

Copy link
Copy Markdown
Contributor Author

@iwhalen iwhalen May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good point actually... it should be read from save_args.

However, the way its implemented right now would require a user to specify data_files in both the load and save args.

Concretely, it would have to look like this in yaml:

reviews:
  type: huggingface.CSVDataset
  path: data/01_raw/reviews
  load_args:
    data_files:
      labels: labels.csv
      data: data.csv
  save_args:
    data_files:
      labels: labels.csv
      data: data.csv

Which isn't the most user friendly... hmm...

I guess there are two options: swap the saving operation to read from save_args and require it to be specified twice or allow data_files as a top-level argument to the dataset.

In this second option,

reviews:
  type: huggingface.CSVDataset
  path: data/01_raw/reviews
  data_files:
    labels: labels.csv
    data: data.csv

would work and so would specifying data_files in both the save and load args, but something like:

reviews:
  type: huggingface.CSVDataset
  path: data/01_raw/reviews
  data_files:
    labels: labels.csv
    data: data.csv
  save_args:
    data_files:
      labels: labels.csv
      data: data.csv

would throw an error (regardless of if filenames match or not).

What do you think @ankatiyar? Or maybe there's another option you'd prefer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this information is used for both saving and loading, I don't mind if it's a top level key - you're right having to specify it for both load and save args might be tedious but maybe templating (with OmegaConfigLoader) could also help for that option and this approach can be documented -

_data_files:
  labels: labels.csv
  data: data.csv

reviews:
  type: huggingface.CSVDataset
  path: data/01_raw/reviews
  load_args:
    data_files: ${_data_files}
  save_args:
    data_files: ${_data_files}

I'm okay with either approach, just using load_args for saving stuff feels not quite right. WDYT? @ElenaKhaustova

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @ankatiyar that using load_args for save behaviour feels off. I'd go with Option B (top-level data_files) rather than duplicating it across load_args/save_args:

  • data_files describes the directory layout, which both load and save need to agree on — same shape as path, which is also top-level.
  • Single source of truth, no risk of load_args and save_args drifting apart.
  • Surfaces it in the constructor signature instead of hiding it inside an opaque load_args dict.
  • OmegaConf templating works, but relies on users knowing the trick and still produces visually duplicated YAML.

While you're in this code, there's also a related bug worth fixing: _save_dataset_dict currently ignores the filename values in data_files on save — it synthesises f"{save_path}/{split}{ext}" regardless. So:

load_args:
  data_files:
    labels: my_labels.csv    # silently ignored on save
    data:   my_data.csv

…will save to labels.csv / data.csv but then load looks for my_labels.csv / my_data.csv → round-trip fails. The current validation only checks keys, not values, and the test fixtures all use {split}{ext} filenames so this never gets exercised.

So my suggestion:

  1. Promote data_files to a top-level constructor arg.
  2. Raise a clear error if it also appears in load_args or save_args.
  3. Make the save loop use f"{save_path}/{data_files[split]}" so filenames round-trip.
  4. Add a test with a custom filename (e.g. {"train": "my_train.csv"}) to lock the behaviour in.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Agreed. I'll do this.

@iwhalen iwhalen force-pushed the feat/add-local-hf-dataset branch from 4bf2680 to ce98b10 Compare May 12, 2026 21:18
iwhalen and others added 19 commits May 12, 2026 16:19
Signed-off-by: iwhalen <[email protected]>
Signed-off-by: iwhalen <[email protected]>
Signed-off-by: iwhalen <[email protected]>
…edro-org#1364)

* Add OpikEvaluationDataset

Signed-off-by: Laura Couto <[email protected]>

* Add unit tests

Signed-off-by: Laura Couto <[email protected]>

* Lint

Signed-off-by: Laura Couto <[email protected]>

* Lint

Signed-off-by: Laura Couto <[email protected]>

* Docstring

Signed-off-by: Laura Couto <[email protected]>

* Add OpikEvaluationDataset stuff to the readme

Signed-off-by: Laura Couto <[email protected]>

* Add OpikEvaluationDataset

Signed-off-by: Laura Couto <[email protected]>

* Add unit tests

Signed-off-by: Laura Couto <[email protected]>

* Lint

Signed-off-by: Laura Couto <[email protected]>

* Lint

Signed-off-by: Laura Couto <[email protected]>

* Docstring

Signed-off-by: Laura Couto <[email protected]>

* Add OpikEvaluationDataset stuff to the readme

Signed-off-by: Laura Couto <[email protected]>

* Docs and release note

Signed-off-by: Laura Couto <[email protected]>

* Typo

Signed-off-by: Laura Couto <[email protected]>

* Update kedro-datasets/kedro_datasets_experimental/opik/opik_evaluation_dataset.py

Co-authored-by: Ravi Kumar Pilla <[email protected]>
Signed-off-by: L. R. Couto <[email protected]>

* Add more explicit errors in case of connection failure

Signed-off-by: Laura Couto <[email protected]>

* Opik client flush to prevent async issues

Signed-off-by: Laura Couto <[email protected]>

* Explicitly explain remote sync behavior

Signed-off-by: Laura Couto <[email protected]>

* Fix release notes

Signed-off-by: Laura Couto <[email protected]>

* Rephrase docstrings

Signed-off-by: Laura Couto <[email protected]>

* Handle UUID more carefully

Signed-off-by: Laura Couto <[email protected]>

* Wrap _client.flush() in try/except for  DatasetError

Signed-off-by: Laura Couto <[email protected]>

* Update README, add more explicit exception handling

Signed-off-by: Laura Couto <[email protected]>

* Lint

Signed-off-by: Laura Couto <[email protected]>

* Enforce UUIDv7

Signed-off-by: Laura Couto <[email protected]>

* Lint

Signed-off-by: Laura Couto <[email protected]>

* Clarify interactions with UUIDv7 on docs

Signed-off-by: Laura Couto <[email protected]>

* Extract auxiliary functions

Signed-off-by: Laura Couto <[email protected]>

* Make it so 'non UUIDv7 creates a new row' is very explicit

Signed-off-by: Laura Couto <[email protected]>

* Update kedro-datasets/kedro_datasets_experimental/opik/opik_evaluation_dataset.py

Co-authored-by: ElenaKhaustova <[email protected]>
Signed-off-by: L. R. Couto <[email protected]>

* Update kedro-datasets/kedro_datasets_experimental/opik/opik_evaluation_dataset.py

Co-authored-by: ElenaKhaustova <[email protected]>
Signed-off-by: L. R. Couto <[email protected]>

* Update kedro-datasets/kedro_datasets_experimental/opik/opik_evaluation_dataset.py

Co-authored-by: ElenaKhaustova <[email protected]>
Signed-off-by: L. R. Couto <[email protected]>

* Update kedro-datasets/kedro_datasets_experimental/opik/README.md

Co-authored-by: ElenaKhaustova <[email protected]>
Signed-off-by: L. R. Couto <[email protected]>

* Update kedro-datasets/kedro_datasets_experimental/opik/opik_evaluation_dataset.py

Co-authored-by: ElenaKhaustova <[email protected]>
Signed-off-by: L. R. Couto <[email protected]>

* Update kedro-datasets/kedro_datasets_experimental/opik/README.md

Co-authored-by: ElenaKhaustova <[email protected]>
Signed-off-by: L. R. Couto <[email protected]>

* Fix docstrings

Signed-off-by: Laura Couto <[email protected]>

* Minor fix on docstring

Signed-off-by: Laura Couto <[email protected]>

* Minor fix on docstring

Signed-off-by: Laura Couto <[email protected]>

* Doc indent

Signed-off-by: Laura Couto <[email protected]>

* Indent

Signed-off-by: Laura Couto <[email protected]>

* Lint

Signed-off-by: Laura Couto <[email protected]>

---------

Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: L. R. Couto <[email protected]>
Co-authored-by: Ravi Kumar Pilla <[email protected]>
Co-authored-by: ElenaKhaustova <[email protected]>
Signed-off-by: iwhalen <[email protected]>
Signed-off-by: iwhalen <[email protected]>
Signed-off-by: iwhalen <[email protected]>
Signed-off-by: iwhalen <[email protected]>
Signed-off-by: iwhalen <[email protected]>
@iwhalen iwhalen force-pushed the feat/add-local-hf-dataset branch from ce98b10 to adbf9a4 Compare May 12, 2026 21:19
Signed-off-by: iwhalen <[email protected]>
@iwhalen
Copy link
Copy Markdown
Contributor Author

iwhalen commented May 13, 2026

@ElenaKhaustova @ankatiyar thanks again for all the feedback!

There's a seemingly unrelated segfault happening now, but the changes are up.

The only place I had to make an arbitrary decision was in the case that load_args["data_files"] != save_args["data_files"]. In this case, I just do nothing and let the user save and load to different files if they want to.

Its a weird edge case, but I think we want people to use the top level data_files argument anyway 🤷

Copy link
Copy Markdown
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just took a quick pass, but LGTM overall; one nit comment, and then agree with some feedback you already are going to address.

Comment on lines 10 to 11
# For documentation builds that might fail due to dependency issues
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is dumb, but can we include this in each block, in the spirit of blind consistency with how other datasets are. 😅

"""

BUILDER: ClassVar[str] = "parquet"
EXTENSION: ClassVar[str] = ".parquet"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, this extension is only used if you don't specify save_data_files? Does this also work for loading e.g. .parquet.snappy or some other sorts of extensions?

I think so, but want to be sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants